Show the code
pacman::p_load(jsonlite, tidygraph, ggraph, visNetwork, gralayouts, ggforce, tidytext, tidyverse, skimr, patchwork, ggdist, ggridges, ggthemes, scales)Shaun Tan
June 3, 2023
June 18, 2023
The country of Oceanus has sought FishEye International’s help in identifying companies possibly engaged in illegal, unreported, and unregulated (IUU) fishing. As part of the collaboration, FishEye’s analysts received import/export data for Oceanus’ marine and fishing industries. However, Oceanus has informed FishEye that the data is incomplete. To facilitate their analysis, FishEye transformed the trade data into a knowledge graph. Using this knowledge graph, they hope to understand business relationships, including finding links that will help them stop IUU fishing and protect marine species that are affected by it. FishEye analysts found that node-link diagrams gave them a good high-level overview of the knowledge graph. However, they are now looking for visualizations that provide more detail about patterns for entities in the knowledge graph
Use visual analytics to identify anomalies in the business groups present in the knowledge graph. Limit your response to 400 words and 5 images.
mc3_nodes <- as_tibble(mc3_data$nodes) %>%
# distinct() %>%
mutate(country = as.character(country),
id = as.character(id),
product_services = as.character(product_services),
revenue_omu = as.numeric(as.character(revenue_omu)),
type = as.character(type)) %>%
select(id, country, type, revenue_omu, product_services)| Name | mc3_edges |
| Number of rows | 24036 |
| Number of columns | 4 |
| _______________________ | |
| Column type frequency: | |
| character | 3 |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| source | 0 | 1 | 6 | 700 | 0 | 12856 | 0 |
| target | 0 | 1 | 6 | 28 | 0 | 21265 | 0 |
| type | 0 | 1 | 16 | 16 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| weights | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | ▁▁▇▁▁ |
The are no missing variables in the mc3_edges dataframe.
We next take a glimpse of the first few lines of the dataframe using the following datatable.
This dataframe depicts the connections between companies and company contacts/beneficial owners.
To further explore the count of each type of contact, we plot a bar chart of the type of edges in the mc3_edges dataframe.

There are significantly more beneficial owners (16792) vs company contacts (7244)
| Name | mc3_nodes |
| Number of rows | 27622 |
| Number of columns | 5 |
| _______________________ | |
| Column type frequency: | |
| character | 4 |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| id | 0 | 1 | 6 | 64 | 0 | 22929 | 0 |
| country | 0 | 1 | 2 | 15 | 0 | 100 | 0 |
| type | 0 | 1 | 7 | 16 | 0 | 3 | 0 |
| product_services | 0 | 1 | 4 | 1737 | 0 | 3244 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| revenue_omu | 21515 | 0.22 | 1822155 | 18184433 | 3652.23 | 7676.36 | 16210.68 | 48327.66 | 310612303 | ▇▁▁▁▁ |
There are 21515 missing values in the revenue_omu column. Given the importance of this data, and the huge number of missing values, it rouses suspicion and will be explored in greater detail subsequently.
Next, a bar chart of the type of nodes is plotted.
Next, we explore explore Country of Origin of the nodes for interesting relationships, comparing between Companies vs. Company contacts vs. Beneficial owners.
plot_company <- nodes_country %>%
filter(type == "Company" &
count > 150) %>%
ggplot(aes(x = reorder(country, -count), y = count)) +
geom_col() +
ylim(0,4000) +
geom_text(
aes(label = count),
vjust = -2
) +
labs(
title = "Count of Company's Country of Origin", y= "Count", x = "Country", subtitle = "Companies predominantly from ZH, Oceanus, and Marebak"
)
plot_owner <- nodes_country %>%
filter(type == "Beneficial Owner") %>%
ggplot(aes(x = reorder(country, -count), y = count)) +
geom_col() +
ylim(0,14000) +
geom_text(
aes(label = count),
vjust = -2
) +
labs(
title = "Count of Beneficial Owner's Country of Origin", y= "Count", x = "Country", subtitle = "Beneficial Ownders predominantly from ZH"
)
plot_contacts <- nodes_country %>%
filter(type == "Company Contacts") %>%
ggplot(aes(x = reorder(country, -count), y = count)) +
geom_col() +
ylim(0,8000) +
geom_text(
aes(label = count),
vjust = -2
) +
labs(
title = "Count of Company Contacts' Country of Origin", y= "Count", x = "Country", subtitle = "Company Contact predominantly from ZH"
)
plot_company / plot_owner / plot_contacts
Despite most of the owners and and company contacts orginating from ZH, there count of country of origin of the company is more diverse, with ZH, Oceanus, and Marebak taking the top 3 spots. It is indicative of owners venturing out of their own countries to set up companies in other countries.
Revenue is postulated to provide substantial information on the company, and would be instrumental in providing clues to any anomalous behaviours. A boxplot with frequency dot plot overlaid is plotted to visualise the distribution.
company_revenue <- mc3_nodes %>%
filter(type == "Company")
ggplot(company_revenue,
aes(y = revenue_omu)) +
scale_y_continuous(
limits = c(0, 200000),
breaks = pretty_breaks(n = 5),
labels = dollar_format())+
geom_boxplot(width = 0.5,
outlier.shape = NA, color = 'darkred') +
stat_dots(color = 'blue') +
coord_flip() +
labs(
title = "Distribution of Revenue of Companies", y= "Revenue", x = "Count", subtitle = "Highly right skewed distribution of companies' revenue"
)
The chart above shows that a majority of the companies make less than $25,000 in revenue and the distribution of the revenue of companies os right skewed.
Getting the number of owners each company has:
Exploring the datatable of edges by companies:
The table above displays the number of owner each company has. It is postulated that companies with multiple beneficial owners has the oversight of many people, it is unlikely to be engaged in dubious activities, whereas companies which are sole proprietorships are at the bidding of that single beneficial owner.
filtered_data <- owner_count_df %>%
filter(owner_count <= quantile(owner_count, 0.2))
ggplot(filtered_data, aes(x = factor(owner_count), y = count)) +
geom_col() +
ylim(0,7000) +
geom_text(
aes(label = count),
vjust = -1,
size = 3
) +
labs(
title = "Count of Companies by number of owners", y= "Count of Companies", x = "Number of Owners", subtitle = "Majority of companies have only one owner"
)
The above graph shows that 6415 companies are sole proprietorships and form the majority of the companies.
Initial Sensing:
Beneficial owners who are sole owners of multiple companies wield disproportionate influence with having more autonomy without the scrutiny of other beneficial owners.
Companies with unreported revenue yet with many company contacts - shows that the company is extensive, and could be reporting high revenue, yet the revenue is unaccounted for.
Given that the dataset contains companies engaged in many other types of products/services besides fish-related services, we should narrow down the search to just fish related companies to further explore anomalous behaviours.
The first anomalous behaviour that will be investigated would be sole beneficial owners of multiple companies, with companies having higher revenue being more suspicious. The reason being, there is no need for transparency and being accountable to shareholders for these companies. As such, they are less deterred from pursuing illegal activities given the lesser oversight.
Wrangling the edges and nodes dataframe to enable the creation of a network graph.
single_bowner_count_revenue <- left_join(single_bowner_count, mc3_nodes, by = c("source"="id")) %>%
select(-type.y) %>%
rename("type" = "type.x")
single_bowner_count_revenue1 <- single_bowner_count_revenue %>%
distinct() %>%
rename("from" = "source",
"to" = "target")
bowner_source <- single_bowner_count_revenue1 %>%
distinct(from) %>%
rename("id" = "from")
bowner_target <- single_bowner_count_revenue1 %>%
distinct(to) %>%
rename("id" = "to")
bowner_nodes_extracted <- rbind(bowner_source, bowner_target)
bowner_nodes_extracted$group <- ifelse(bowner_nodes_extracted$id %in% single_bowner_count_revenue$source, "Company", "Beneficial Owner")Creating the visNetwork Graph
visNetwork(
bowner_nodes_extracted,
single_bowner_count_revenue1
) %>%
visIgraphLayout(
layout = "layout_with_fr"
) %>%
visGroups(groupname = "Company",
color = "lightblue") %>%
visGroups(groupname = "Beneficial Owner",
color = "yellow") %>%
visLegend() %>%
visEdges(
arrows = "to"
) %>%
visOptions(
highlightNearest = list(enabled = T, degree = 2, hover = T),
nodesIdSelection = TRUE,
selectedBy = "group",
collapse = TRUE)With the possibility of lesser oversight, there is the chance that these companies may be participating in suspicious activity. However, it should be noted that the size as well and the fidelity of the nature of business of these companies is not available on the network graph. It should therefore be investigated in greater detail, before a solid conclusion can be formed. However, in the meantime, it is the exception and not the norm and should be monitored.
The next suspicious behaviour that deserves investigating would be companies with many company contacts, that have missing revenue reported. There is a chance that these companies are in fact bring in substantial revenue but have undeclared their revenue.
Wrangling the edges and nodes dataframe to enable the creation of a network graph.
# Extract nodes that have unreported revenue
nodes_norev <- mc3_nodes %>%
filter(is.na(revenue_omu))
nodes_norev_compcontact <- nodes_norev %>%
filter(type == "Company Contacts") %>%
distinct()
# Extracting edges that are company contacts
edges_norev <- mc3_edges %>%
filter(type == "Company Contacts") %>%
filter(source %in% nodes_norev_compcontact$id) %>%
distinct() %>%
rename("from" = "source",
"to" = "target")
# Extract edges that have more than or equal to 3 company contacts
edges_norev_high <- edges_norev %>%
group_by(from) %>%
mutate(count = n()) %>%
filter(count >= 3) %>%
ungroup()Creating the visNetwork graph
visNetwork(
nodes_norev1,
edges_norev_high
) %>%
visIgraphLayout(
layout = "layout_with_fr"
) %>%
visGroups(groupname = "Company",
color = "lightblue") %>%
visGroups(groupname = "Company Contact",
color = "yellow") %>%
visLegend() %>%
visEdges(
arrows = "to"
) %>%
visOptions(
highlightNearest = list(enabled = T, degree = 2, hover = T),
nodesIdSelection = TRUE,
selectedBy = "group",
collapse = TRUE)It is indeed suspicious that so many large companies have unreported revenue. The potulated size, given the lack of revenue data, can only be extrapolated using the number of contacts, and the number of beneficial owners. It should be noted that the number of contacts of these top few companies are similar to those of the top few companies with reported revenue. This should be investigated in further details with other forms of proxy information obtained and pieced together to determine what they can possibly be hiding.
The last visual that i would be using to explore specifically fish-related anomalous behaviour would be the networks of biggest company-beneficial owner relationships of fish-related businesses. The reason that this is done is to compare and understand the typical network size of a fish-related business in terms of number of beneficial owners, and compare it with the industry standard.
Wrangling the edges and nodes dataframe to enable the creation of a network graph.
# Extract nodes that are fish-related
fish_nodes <- mc3_nodes %>%
filter(grepl("fish", product_services, ignore.case = TRUE))
fish_nodes_bowners <- fish_nodes %>%
filter(type == "Beneficial Owners") %>%
distinct()
fish_nodes_companies <-fish_nodes %>%
filter(type == "Company") %>%
distinct()
# Extract edges that are fish related
edges_fish <- mc3_edges %>%
filter(type %in% c("Company", "Beneficial Owner")) %>%
filter(source %in% fish_nodes$id) %>%
distinct() %>%
rename("from" = "source",
"to" = "target")
# Extract edges that have more than or equal to 8 links
edges_fish_high <- edges_fish %>%
group_by(from) %>%
mutate(count = n()) %>%
filter(count >= 8) %>%
ungroup()Creating the visNetwork Graph
visNetwork(
nodes_fish1,
edges_fish
) %>%
visPhysics(solver = "forceAtlas2Based",
forceAtlas2Based = list(gravitationalConstant = -100)) %>%
visIgraphLayout(
layout = "layout_with_fr"
) %>%
visGroups(groupname = "Company",
color = "yellow") %>%
visGroups(groupname = "Beneficial Owner",
color = "lightblue") %>%
visLegend() %>%
visEdges(
arrows = "to"
) %>%
visOptions(
highlightNearest = list(enabled = T, degree = 2, hover = T),
nodesIdSelection = TRUE,
selectedBy = "group",
collapse = TRUE)It is interesting to note that there are no personnel that are beneficial owners of more than 1 companies for these “large” or extensive companies. There is a possibility that they do not want to have multiple owners for fear of a conflict of interest or corporate espionage. While this is not necessarily anomalous behaviour, it is an interesting point to note, and it may even be representative of the different cartels in the fish industry.
Exploring the networks between the various type of nodes or players in the space has been useful to visualising the relationships between the different parties. It has yielded interesting insights on how certain companies may wield more autonomy, or how certain companies, despite being expansive, may be hiding behind unreported revenues.
For future work, the additional column of product services can be sorted and analysed using text analytics methods, to provide an additional layer of to the overall visualisation of networks and information in this project.